AITopics | phone recognition

In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data, and 2) training a model directly for the target task using non-native English data. We compare the performance of these two approaches for various SSL representations as well as a representation extracted from a traditional DNN-based speech recognition model. We evaluate the models on L2Arctic and EpaDB, two datasets of non-native speech annotated with pronunciation labels at the phone level. Overall, we find that using a downstream model trained for the target task gives the best performance and that most upstream models perform similarly for the task.

detection, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2307.16324

Country:

North America > United States (0.28)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.67)

Add feedback

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Neural Information Processing SystemsApr-6-2023, 13:37:08 GMT

Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5\%, which is superior to all published results on speaker-independent TIMIT to date.

artificial intelligence, machine learning, restricted boltzmann machine, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.90)

Add feedback

Compressing Transformer-based self-supervised models for speech processing

Lin, Tzu-Quan, Yang, Tsung-Huan, Chang, Chun-Yao, Chen, Kuang-Ming, Feng, Tzu-hsun, Lee, Hung-yi, Tang, Hao

arXiv.org Artificial IntelligenceNov-17-2022

Despite the success of Transformers in self-supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, prior to applying them to downstream tasks. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report wall-clock time, the number of parameters, and the number of multiply-accumulate operations for these techniques, charting the landscape of compressing Transformer-based self-supervised models.

machine learning, natural language, pruning, (16 more...)

arXiv.org Artificial Intelligence

2211.09949

Country: Asia > Taiwan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On Compressing Sequences for Self-Supervised Speech Models

Meng, Yen, Chen, Hsuan-Jui, Shi, Jiatong, Watanabe, Shinji, Garcia, Paola, Lee, Hung-yi, Tang, Hao

arXiv.org Artificial IntelligenceOct-25-2022

Compressing self-supervised models has become increasingly necessary, as self-supervised models become larger. While previous approaches have primarily focused on compressing the model size, shortening sequences is also effective in reducing the computational cost. In this work, we study fixed-length and variable-length subsampling along the time axis in self-supervised learning. We explore how individual downstream tasks are sensitive to input frame rates. Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference. Variable-length subsampling performs particularly well under low frame rates. In addition, if we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.07189

Country:

Asia > Taiwan (0.05)
North America > Canada > Quebec > Montreal (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Dahl, George, Ranzato, Marc', aurelio, Mohamed, Abdel-rahman, Hinton, Geoffrey E.

Neural Information Processing SystemsFeb-15-2020, 00:43:46 GMT

Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN.

mean-covariance restricted boltzmann machine, phone recognition, restricted boltzmann machine, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.90)

Add feedback

Attention model for articulatory features detection

Karaulov, Ievgen, Tkanov, Dmytro

arXiv.org Machine LearningJul-2-2019

Articulatory distinctive features, as well as phonetic transcription, play important role in speech-related tasks: computer-assisted pronunciation training, text-to-speech conversion (TTS), studying speech production mechanisms, speech recognition for low-resourced languages. End-to-end approaches to speech-related tasks got a lot of traction in recent years. We apply Listen, Attend and Spell~(LAS)~\cite{Chan-LAS2016} architecture to phones recognition on a small small training set, like TIMIT~\cite{TIMIT-1992}. Also, we introduce a novel decoding technique that allows to train manners and places of articulation detectors end-to-end using attention models. We also explore joint phones recognition and articulatory features detection in multitask learning setting.

machine learning, natural language, recognition, (20 more...)

arXiv.org Machine Learning

1907.01914

Country:

Europe (0.28)
Asia > India (0.15)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Phoneme Recognition with Large Hierarchical Reservoirs

Triefenbach, Fabian, Jalalvand, Azarakhsh, Schrauwen, Benjamin, Martens, Jean-pierre

Neural Information Processing SystemsDec-31-2010

Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant.

neuron, recognition, reservoir, (14 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Belgium > Flanders > East Flanders > Ghent (0.04)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.89)

Add feedback

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Dahl, George, Ranzato, Marc', aurelio, Mohamed, Abdel-rahman, Hinton, Geoffrey E.

Neural Information Processing SystemsDec-31-2010

Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5\%, which is superior to all published results on speaker-independent TIMIT to date.

artificial intelligence, machine learning, recognition, (16 more...)

Neural Information Processing Systems

Country: